Affan Hamid

Welcome to a new course: Large language models from scratch. In this series we're going to be delving into the workings of LLMs like Chat GPT and making our very own model.

We're going to start this series out simple with a bigram model. A bigram model is one that takes in a token and predicts what the next token is supposed to be.

What is a token?

A token is an abritrary unit of a sentence. You can define how small or large this unit should be. The smallest possible unit would be a character. So if your sentence is "How are you doing?" Then the tokenization of that on a character level would be:

["H", "o", "w", " ", "a", "r", "e", " ", "y", "o", "u", "d", "o", "i", "n", "g", "?"]

But we can also have word-level tokenization. Here, each word is a token.

["How", "are", "you", "doing?"]

Commercial LLMs use a special form of tokenization which is between both these levels. It may look like this:

["how", "are", "you", "do", "ing", "?"]

In this guide, we'll begin with character-level and word-level tokenization.

Dataset used

I'm using the Tiny Shakespeare text file which is a collection of all of shakespeare's plays in a single text file.

Flow of this guide:

We'll first begin by loading in the input and tokenizing it
We'll then make a simple bigram model
Finally, we'll generate a test output using this model

Processing the input

You can paste the tiny shakespeare file in your directory and directly load it in python using the open function.

data = open('shakespeare.txt', 'r').read()

I've named by data file as "shakespeare.txt".

We will first start with a character-level model. To do this, we need to break down this massive text file into it's individual characters. Luckily, python already has a feature to break down strings into lists. We simply call the list() function on the data

data = list(data)

If we need to use a word-level tokenization, this step simply becomes:

data = data.split()

Where data.split() splits the text file based on white space characters: \n, \t, ' ' and so on. Though this is not a perfect tokenizer, it works well enough as an introduction.

Now that we've loaded and tokenized our data, let's make the model.

Making the bigram model

The word "bigram" simply means a sequence of two adjacent tokens. Our model will be a simple probability distribution. Essentially, given a token, we have to find the token with the highest probability of being next.

For example, if I give you the word "How" and ask you to predict the next word, what will you say?

You could say "are", or "is", or any of the commonly used words after "How". But you will never say "down" after "How" because the sentence "How down" is complete non-sense. You do this because you have a general idea of what words come after each other. We'll model something similar using probability distributions.

For every word in our text, we'll keep track of all the words that come after it and how many times they do. Once we do this, we know what words come after each other and how often they do. We can then turn this into a probability distribution.

Our data structure to store this probability distribution will be a dictionary of dictionaries. The keys of the first dictionary will be all the words in the text. The value of each key will be the probability distribution of all the words that came after the key in the text.

Initializing the dictionary

distribution = {}

Iterating through the data

Then we need to iterate each word and the next word in the string. A neat way to do this is using zip in python:

for word1, word2 in zip(data, data[1:]):
    print(word1, word2)

So if our data was "How are you doing?" This for loop will return:

How are
are you
you doing?

Adding the keys to the dictionary

Since we initialize our dictionary empty, we cannot access any keys in it. First we must check if our current word is a key of the dictionary, if not, we need to initialize that too.

for word1, word2 in zip(data, data[1:]):
    if word1 in distribution.keys():
        # Logic for adding the probability distribution for each key
    else:
      distribution[word1] = {word2: 1}

Again if our data was "How are you doing?", the distribution variable after the loop will be:

{
  "How": {"are": 1},
  "are": {"you": 1},
  "you": {"doing?": 1}
}

But in our example, each word was unique. So all the counts are 1. What if we have different words coming after the same word. For example, if our data is "How are you doing? are you happy? are my brothers happy?". How should we count now?

Look at the word "are". After this word we have:

"you" twice
"my" once

Since the word "you" already exists in the distribution of "are", we simply add it. Since the word "my" doesn't exist in the distribution of "are", we initialize it to 1.

So our final code looks like:

distribution = {}

for word1, word2 in zip(data, data[1:]):
    if word1 in distribution.keys():
        if word2 in distribution[word1].keys():
            distribution[word1][word2] += 1
        else:
            distribution[word1][word2] = 1
    else:
        distribution[word1] = {word2:1}

Now, our distribution will look like:

{
 'How': {'are': 1},
 'are': {'you': 2, 'my': 1},
 'you': {'doing?': 1, 'happy?': 1},
 'doing?': {'are': 1},
 'happy?': {'are': 1},
 'my': {'brothers': 1},
 'brothers': {'happy?': 1}
}

Constructing the probability distirbution

Finally, we gotta turn the counts into a probability distribution. To do this, we'll iterate through each key, add up all the values in the distribution of the key and divide each value by the total.

Firstly, we get the total sum for each key of the distribution with the following code:

for key in distribution.keys():
    total = sum(distribution[key][k] for k in distribution[key].keys())

Essentially, for each "key", we go through all the keys of that key's distribution, get the count using distribution[key][k] and sum it up.

Then we have to go through all the keys again and divide by the total:

for key in distribution.keys():
    total = sum(distribution[key][k] for k in distribution[key].keys())
    for k in distribution[key].keys():
        distribution[key][k] /= total

Finally, we get the probability distribution. For our example, the distribution looks like:

{'How': {'are': 1.0},
 'are': {'you': 0.6666666666666666, 'my': 0.3333333333333333},
 'you': {'doing?': 0.5, 'happy?': 0.5},
 'doing?': {'are': 1.0},
 'happy?': {'are': 1.0},
 'my': {'brothers': 1.0},
 'brothers': {'happy?': 1.0}}

Now, if we run this on the original dataset with character-level tokenization, we get something like:

{'F': {'i': 0.17139677239844184,
  'o': 0.3082915971062883,
  'r': 0.10740122426266,
  'a': 0.07178631051752922,
  'I': 0.028380634390651086,
  'u': 0.005564830272676683,
  'l': 0.014468558708959377,
  'e': 0.025041736227045076,
  ' ': 0.11908736783528102,
  'F': 0.02949360044518642,
  ':': 0.01001669449081803,
  'O': 0.03394546466332777,
  'R': 0.044518642181413465,
  'L': 0.025041736227045076,
  'E': 0.005564830272676683},
...

Generating text

The hard part is done. Now, all we gotta do is get an initial character, use the probability distribution of that character to pick the next character and repeat this step forever.

import numpy as np
ch = 'a'
output = ''

for i in range(100):
    ch = np.random.choice(list(distribution[ch].keys()), p=list(distribution[ch].values()))
    output += ch

Here, I start the program with "a" was the initial character. Then I go into the probability distribution of that character. It's keys represent possible next characters and its values represent the probability of that character. So I use np.random.choice to select a key. The p argument stands for probabilities. Finally I just add the new ch character to the output and repeat this a 100 times.

The result is:

"ithin, ncomy f an'e rere cemamyor adshise.\n\nTheabllawrto\nNo s weaiew 'er ry nd, blf sheshes. brtlst "

Of course, this looks like a bunch of gibberish. That's probably because character-level is too detailed. You're expecting a simple computer program to recreate human words based on probabilities. Let's retry that but with word-level tokenization.

We have to slightly change the text-generation code. Instead of

output += ch

We need

output += f' {ch}'

This is because we're no longer adding characters, so we need spaces between words.

Finally, we get:

' bond and I bring me speak unto some grief and thy cause to lave her double. Nay, as he are put to the issue of her, hilding! Nurse: Out on to maintain Upon a tap-house, but he hath a brother Montague, I can you to Vincentio. His barbed steeds to thy brazen trumpet any further, That they are those vapours up, and howled in my sense. What seal with Juliet.

That almost makes sense? The fact that its shakespeare doesn't help either. But this is a good start. The model does a good job putting sensible words one after each other like "he hath a brother Montague".

Conclusion

In this guide we began our journey into LLMs by making a simple bigram model. We used a probability distribution to model how natural language works. Though this is a good start, it isn't very accurate. Stay tuned for the next guide where we try to make our model better.